The eectiveness of automated writing evaluation in EFL/ESL writing: a three-level meta-analysis

Thuy Thi-Nhu Ngo, Howard Hao-Jan Chen and Kyle Kuo-Wei Lai

English Department, National Taiwan Normal University, Taipei City, Taiwan


The present study performs a three-level meta-analysis to investigate the overall eectiveness of automated writing evaluation (AWE) on EFL/ESL student writing performance. 24 primary studies representing 85 between-group eect sizes and 34 studies representing 178 within- group eect sizes found from 1993 to 2021 were separately meta- analyzed. The results indicated a medium overall between-group eect size (g = 0.59) and a large overall within-group eect size (g = 0.98) of AWE on student writing performance. Analyses of moderators show that: (1)- AWE is more eective in improving vocabulary usage but less eective in improving grammar in studentswriting; (2)- Grammarly shows potential in being a highly eective tool, though Pigai did not demonstrate such eectiveness; (3)- Medium to long duration of AWE usage leads to a higher eect, but short duration leads to a lower eect in writing outcome compared to non-AWE treatment; (4)- Studying with peers in AWE condition potentially produces a large eect; (5)- AWE is benecial to students at the undergraduate level, students in the EFL context, and students with intermediate English prociency. Directions for future research are also discussed in the present study. Overall, AWE is a benecial application and is recommended for integration in the writing classroom.


Received 24 February 2022

Accepted 23 June 2022


Automated writing evaluation; AWE; writing; meta-analysis; eectiveness


Writing is an important skill for students to succeed in school and daily lives (Graham, 2019). To be a competent writer, it is necessary to master both basic processes (e.g. handwriting/typing, spelling) and complex processes (e.g. idea generation and organization, idea transformation into language, writing revision) in writing. This task requires eective instructional practices in the writing classroom so that students can develop their writing ability. One of the benecial practices is the use of tech- nology in assisting and scaolding students with the aforementioned processes. With the develop- ment of recent technologies, it is now possible for students to receive technology-based feedback. Automated writing evaluation (AWE) is one of the most widely used application software that pro- vides feedback in writing (Nunes et al., 2021).

AWE is known as a computer tool that can automatically evaluate a written text by providing an overall score and/or feedback on categories such as grammar, mechanics, content, organiz- ation, vocabulary usage, or style (Warschauer & Ware, 2006). Originally, it was developed for the purpose of giving summative scores for written texts. In the last decades, the tool has matured and can now provide detailed automated feedback (Nunes et al., 2021). As a result, the application of AWE has gained popularity in school and university settings as students are

provided with more opportunities to plan, write, and revise written texts with the help of AWE feedback (Cotos, 2014; Grimes & Warschauer, 2010). In EFL/ESL writing classrooms where a large class size is often the norm, the tool can also help reduce the burden of teachersworkloads by its capacity of providing individual feedback in multiple draft writings (Chen et al., 2017; Warschauer & Ware, 2006).

The purpose of the present study is to thus explore the overall eectiveness of automated writing evaluation (AWE) on EFL/ESL studentswriting outcomes. The following sections provide brief back- ground literature on the arguments for and against the use of AWE in writing classrooms and the need for conducting a meta-analysis on the present topic.

Arguments for and against AWE

The eectiveness of AWE on student writing performance has sparked much debate among researchers and educators (Hocky, 2019). On the one hand, researchers who promote the use of AWE consider the tool to hold three main benets. First, the tool has the ability to appropriately evaluate student writing as much as teachers do, but in a much more time- and cost-eective way (Cotos, 2011). Second, students learning motivation and autonomy are facilitated through the learning environment that can support the scaolding practices throughout the multiple draft- ing process of AWE (Chen & Cheng, 2008; Cotos, 2011). Moreover, students can receive sucient support from the tool when teachers are not available (Wang, 2013). Third, the integration of AWE into writing instruction is believed to provide a more consistent and objective assessment across the curriculum (Cotos, 2011). Meanwhile, human evaluation is exible and limited based on student individual dierences (Parra & Calero, 2019).

On the other hand, several issues on AWE usage are raised by some researchers as well. One of these issues is the vagueness of the feedback, as AWE oers no concrete suggestions for students to improve their ability in presenting consistent, unied, and relevant messages in the writing (Lai, 2010). Furthermore, AWE feedback is predetermined by the computer programing, which limits its ability to provide rich negotiation of meaning and contributes less to the content development of the writing (Chen & Cheng, 2008). Another counterargument of AWE usage is that it discriminates against students who are less experienced with technology use (Khoii & Doroudian, 2013). These arguments against AWE usage should also be taken into account in writing instruction. However, a more important consideration should be on the overall eectiveness of AWE which is uncovered in the present meta-analysis.

The need for a meta-analysis

Although there is evidence of its usefulness, the overall eectiveness of AWE is still underexplored. A most recent attempt to synthesize the studies on the eectiveness of AWE is the systemic review in Nunes et al.s (2021) study. Their review oered much insight into the eectiveness and application of AWE. However, further investigation is still necessary to examine some issues that were not explored in their study. First of all, Nunes et al. (2021) solely reviewed the eectiveness of AWE in school settings (i.e. Grade 112), which were scarcely studied in the previous literature. If various education levels had been included in the review (e.g. university level), the study may have provided a more complete picture of the eectiveness of AWE. Hence, their review could only obtain eight studies for which the conclusiveness of the ndings might be of question. Second, the authors included studies on L1 together with studies on L2 in their review. This combination might induce some problems when synthesizing the data, as writing in the L1 is dierent from writing in the L2 at the lexicon level (e.g. word formation, word choice), sentence level (e.g. sentence pat- terns, sentence subject), and passage level (e.g. the choice of writing topic, voice, organization) (Wang, 2012 ). Third, the synthesis of dierent writing-related measures (e.g. grammar, vocabulary, content, organization) was not discussed. The authors found that the collected studies showed a

positive impact of AWE in at least one writing-related measure, but the overall impact of AWE could be more meaningful if each measure was stated. Finally, since this is a systematic review, the quan- titative ndings such as the values of the eect sizes were not explored. As Nunes et al. (2021) aimed at investigating the eectiveness of AWE (i.e. the authors only searched for past studies that had experimental design), this study intends to further conduct a meta-analysis to include the values of eect sizes to better explore the eectiveness of AWE.

Another reason for the necessity of conducting this present meta-analysis is the variation of the eect of AWE across dierent studies. When the overall writing scores were compared between the experimental group (i.e. the group with the use of AWE) and the control group (i.e. the group with traditional teaching), the calculated Hedgesg (i.e. a measure of eect size) was small negative (g =

0.29) in Mørch et al. (2017), moderate positive (g = 0.52) in Rich (2012), but large positive (g > 3) in Hassanzadeh and Fotoohnejad (2021). The Hedgesg values ranged from negative to positive were also observed in the sub-categories of writing such as grammar, content, or vocabulary among the primary studies (Gao & Ma, 2019; Huang & Renandya, 2020; Liu et al., 2017; Wang et al., 2013). An apparent explanation for the inconsistent ndings was the heterogeneity among the studies, meaning potentially dierent factors (i.e. study features such as outcome measures, AWE tools, dur- ation, etc.) had caused such variation.

To tackle the aforementioned issues, a meta-analysis that can synthesize research ndings and provide substantial evidence of eect sizes in the impact of AWE on writing outcomes is needed to understand the overall eectiveness of AWE, as well as to discern the potential account for the variation observed in the eect. Therefore, the present study makes the rst attempt to investigate the eectiveness of AWE on EFL/ESL student writing performance following the two research ques- tions below:


Inclusion and exclusion criteria

The present study investigated the eectiveness of AWE in EFL/ESL student writing both in between- group and within-group comparisons. The former means the comparison of writing outcomes between the experimental groups (i.e. the groups that had assistance from the AWE tools to revise their writing) and the control groups (i.e. the groups that experienced the traditional teaching method such as receiving feedback from teachers and/or peers or self-revising the writing drafts without the assistance from the AWE tools). The latter was to compare student writing outcomes in their pre- and post-tests or their original and nal drafts after experiencing the treatments from the AWE tools. In certain cases, a study attempting to meta-analyze two types of comparison could oer a more comprehensive picture of an investigated research topic (Lee et al., 2019). There- fore, the present meta-analysis attempted to explore the ndings from both types of comparison. The inclusion and exclusion criteria for the collection of the primary studies were established as follows:

  1. The primary studies included in the meta-analysis should be quantitative studies that contained experimental groups and/or control groups.

  2. The eect of AWE on student writing performance was measured in the primary studies, and the parameters such as means, standard deviations, sample sizes, or other statistical values that could oer enough information to transform to the Hedgesg values must be provided.

  3. The writing performance of students should be assessed based on their writing products. Studies that reported self-perceived performance were excluded.

  4. The studies in which students in experimental groups also received independent feedback from teachers/peers without referring to the AWE tools were excluded unless there were control groups that could help eliminate the independent eect from teachers/peers. The studies that had students use the tools with teachers/peers could be included since the tools were the main source of providing feedback; however, these studies were coded dierently as either learning with teachers or with peers.

  5. The AWE tools used in the studies were to analyze English writing texts. Tools that analyzed the writing texts of other languages (e.g. Dutch, Chinese) besides English were excluded.

  6. Finally, only studies written in English were included.

    Literature search

    In the rst phase, the primary studies were collected through databases such as ProQuest, Wiley Open Library, Taylor & Francis Online, ERIC, and Google Scholar. Some popular SSCI journals related to language learning and technology were further separately accessed to reduce the like- lihood of missing relevant studies, including Computer Assisted Language Learning, Innovation in Language Learning and Teaching, Journal of Computer Assisted Learning, British Journal of Edu- cational Technology, Australasian Journal of Educational Technology, The Journal of the European Association for Computer Assisted Language Learning, CALICO Journal, Language Learning and Tech- nology Journal.

    There were three main sets of keywords for paper screening regarding: (set 1)- AWE-related key- words (e.g. automated writing evaluation, AWE, automated, automated feedback, grammar checker, grammar check, English grammar checker, AI and grammar checker, grammar checker and feed- back), (set 2)- writing performance related keywords (e.g. writing outcome, writing performance, essay writing, essay, English writing, writing accuracy, writing skills, writing), and (set 3)- design- related keywords (e.g. pretest, posttest, pre-test, post-test, experiment, experimental, control, quan- titative, original, revised).

    The search was conducted by using the keywords of the rst set and/or the combination with the keywords in set 2 and set 3. For the keywords that produced several hundreds to thousands of hits, we would read through the rst 100 hundred hits or until there were no more potentially relevant papers found. In total, approximately 11,000 potential hits were quickly examined by their titles and/ or abstracts (including duplication), and nearly 450 potentially relevant papers (excluding dupli- cation) were identied for further examination. After applying all the inclusion and exclusion criteria, 38 primary studies were qualied for the collection in the rst phase.

    The second phase of the paper collection was the reference chasing of the previously 38 collected studies. This phase added six more papers to the nal list of the collection, constituting a total number of 44 papers. Of these 44 papers, 24 papers were classied in the list of between-group com- parison studies and 34 papers in within-group comparison. The sum of the number of papers in the two lists was greater than 44 because some papers provided information on either between-group or within-group comparison, but some other papers provided both.

    Eect size calculation

    The present meta-analysis would separately meta-analyze the two datasets, namely between-group comparison and within-group comparison. The between-group comparison represented the dier- ence between the experimental treatment versus the traditional teaching while the within-group comparison showed the dierence of the treatment versus no teaching. Theoretically speaking, a higher eect size would be expected in the within-group comparison because even subpar teaching in the traditional method would normally result in some improvement (Boulton & Cobb, 2017). It is, therefore, essential to separate the two comparisons. Each eect size was calculated using Hedgesg

    for its consideration of eect-size weighting in the small sample size cases included in the present meta-analysis. The equations for calculation are shown below:

    Hedges g = J

    ×         MeanT MeanC    








    correction factor

1 1 Cohens d2

in which

SEg = Jcorrection factor ×




nT C

+ 2 × (n

+ nC)


Jcorrection factor = 1 3/(4 × (nT + nC 2) 1); Cohens d2



=       MeanT MeanC      ; (nT 1)SD2 + (nC 1)SD2/nT + nC 2

MeanT, nT, and SDT respectively represent the mean, sample size, and standard deviation of the treated group; MeanC, nC, and SDC respectively represent the mean, sample size, and standard devi- ation of the comparison group (i.e. either the control group in between-group comparison or the pre-treated group in within-group comparison; see also Boulton and Cobb (2017), Hedges and Olkin (1985), Lee et al. (2019), and Lipsey and Wilson (2001) for references.)

The argument might lie in the choice between comparing post-test eect sizes or learning gain eect sizes. Calculating learning gain eect sizes is still a controversial issue because of the absence of reporting the standard deviation gains in primary studies. Cuijpers et al. (2017) and Harrer et al. (2021) suggested that it is the best practice to avoid calculating learning gains for meta-analysis.

Lee et al. (2019) generated learning gain eect sizes, but about half of the collected studies could be generated; not all studies providing enough information for the generation was also true for the present meta-analysis. In order to generate learning gain eect sizes, the above authors decided to use the post-test standard deviation of the control group as an alternative to the absence of the learning gain standard deviations. However, in real data, it was highly likely the case that learning gain standard deviations were dierent between the groups. In our data, for example, Choi (2011) and Liu et al. (2017) reported the dierent learning gain deviations of the examined groups. Even though learning gain eect sizes were generated, Lee et al. (2019) also reported that the dierence between the gain eect sizes and the posttest eect sizes was relatively small.

After all of the aforementioned considerations, we decided not to generate learning gain eect sizes for it would not signicantly aect the power of the overall eect sizes. Using the original report of the post-test or learning gain parameters presented in the primary studies would simplify the data management process for any replication and avoid the unsolved controversial issue of calculating learning gain eect sizes for a meta-analysis.


Research on the eectiveness of the AWE tools usually compares a number of dierent variables, then produces several eect sizes within the same study. For example, Huang and Renandya (2020) compared the writing outcomes of the experimental group with the control group in their overall writing score and ve writing subcategories including content, organization, vocabulary, language use, and mechanics. In another example, Liu et al. (2017) evaluated student writing per- formance in terms of seven variables (i.e. spelling, grammar, coherence, conclusion, supporting ideas, sentence diversity, and organization). This common occurrence of research in AWE demands researchers to employ a more rigorous analysis method to deal with the eect size depen- dence when conducting a meta-analysis. In our present study, a three-level meta-analysis is adopted because it could take such dependency into account when generating the overall eect size. We also conducted a comparison of the three-level model with the two-level conventional model using our

current data to test whether the three-level model could better explain the data than the two-level model.

The data sets of the pre-calculated eect sizes were inputted into the R software for analysis. The packages used in the present study were metafor (Viechtbauer, 2010), meta (Balduzzi et al., 2019), tidyverse (Wickham et al., 2019), and dmetar (Harrer et al., 2021). More information on the formula, the codes, and the guide for conducting a three-level meta-analysis in R could be found in Harrer et al. (2021). Appendix 3 presents information regarding the two-level and three-level meta-analysis models along with the codes, packages, and steps for conducting the present meta-analysis in R.

Moderators and coding procedure

In light of many other meta-analyses, we adapted the three commonly used categories including publication data (e.g. publication year, publication type), population data (e.g. education level, learn- ing context, prociency), and treatment data (e.g. outcome measure, AWE tool, duration, feedback target, learning activity) for the moderator investigation (Lee et al., 2019). The description of the examined moderators is presented in Appendix 1. The primary studies underwent multiple coding cycles by the two independent raters to ensure the inter-rater reliability of the codes. The detail of the coding scheme and some pre-determined cases were discussed and agreed upon by the two raters before each rater independently coded the primary studies. The overall average kappa index of 10 moderators in the between-group comparison dataset was 99.16, and the value was 99.42 in the within-group comparison dataset. A few disagreements between the two raters were then discussed to determine the nal codes used for the meta-analysis.


Overall eect sizes

level 2

level 3

Table 1 below shows the overall eectiveness of AWE on student writing performance compared to the traditional teaching method. The pooled eect size based on the three-level meta-analytic model was at the medium level (g = 0.59). The 95% CI (0.15; 1.04) was not across zero, indicating the reliable eect of AWE. The result of the Q-test was signicant (p < .001), implying the substantial variability in the outcomes of the primary studies and the need for moderator analyses. The esti-

level 3

mated variance values were t2

= 0.72 and t2

= 0.93; I2

= 41.94% of the total variance can

level 2

be attributed to between-study heterogeneity; I2 = 54.18% of the total variance to within-study


heterogeneity. The comparison of the three-level model with the two-level conventional model showed a signicantly better t for the three-level model in which the likelihood ratio test was: X2 = 20.18; p < 0.001. Therefore, the application of the three-level model would better explain our between-group comparison data.

Table 2 below presents the overall average eect size of studentswriting performance after

using the AWE tools compared to their own performance before the treatment. The pooled eect size was large [g = 0.98; 95% CI = (0.63; 1.33)]. As of expectation, the within-group overall eect size was larger than the between-group overall eect size. Similarly, the result of the Q-test was sig- nicant (p < .001) for which there was substantial variability in the outcomes of the primary studies

level 3

and the moderator analyses were necessary. The estimated variance values were t2 = 0.98 and

Table 1. Overall average eect size and heterogeneity test results in between-group comparison. Weighted ES 95% CI Heterogeneity













level 2













level 3

level 3

level 2

Notes: ES = eect size; CI = condent interval; n = the number of eect sizes; g = Hedgesg standardized mean dierences; SE = standard error.

Table 2. Overall average eect size and heterogeneity test results in within-group comparison.

Weighted ES 95% CI Heterogeneity

level 3

n g SE Lower Upper Q df p t2



level 3



level 2



level 2

178 0.98 0.18 0.63 1.33 965.90 177 <.001 0.98 88.14% 0.09 8.13%

Notes: ES = eect size; CI = condent interval; n = the number of eect sizes; g = Hedgesg standardized mean dierences; SE = standard error.



level 2

= 0.09; I2

= 88.14% of the total variance can be attributed to between-study heterogeneity,

and I


level 3

level 2

= 8.13% of the total variance to within-study heterogeneity. A comparison of the three-


level model with the two-level model was also conducted. The three-level model in this case also showed a signicantly better t. The result from the likelihood ratio test was: X2 = 60.26; p < 0.001. Thus, conducting a three-level meta-analysis would also better explain our within-group comparison data.

Moderator analyses

In order to investigate variation within the overall eect sizes, 10 groups of moderators classied in three categories (treatment data, population data, publication data) were examined. A series of mul- tiple meta-regression for each moderator was conducted to explore the eect size of each variable within a subgroup (moderator). The results were presented in Tables 35.

Treatment data

Table 3 below reports the results from the moderator analyses to treatment data which includes ve subgroups: outcome measure, tool, duration, feedback target, and activity. First, the results from the outcome measure showed that AWE had a medium between-group eect size on studentsoverall

Table 3. Moderator analyses in the treatment data.

Between-group comparison Within-group comparison

Treatment data



g [95% CI]



g [95% CI]

1. Outcome Measure

(1) Overall Writing



0.73 [0.12; 1.59]



1.24* [0.89; 1.59]

(2) Content & Organization



0.74 [0.01; 1.50]



0.88*** [0.48; 1.28]

(3) Grammar & Mechanics



0.27 [0.51; 1.06]



0.86 [0.56; 1.16]

(4) Vocabulary



0.83 [0.11; 1.77]



0.99 [0.64; 1.34]

(5) Style



0.29 [1.45; 2.03]



0.48 [0.01; 0.97]

(6) Text Complexity



0.58 [0.35; 1.52]



0.59 [0.29; 0.88]

(7) Vocabulary + Style



0.96 [0.05; 1.86]

2. Tool

(1) Criterion



0.34 [0.81; 1.49]



0.93 [1.70; 3.56]

(2) Grammarly



1.04* [0.21; 2.29]



1.86 [0.91; 4.63]

(3) Pigai



0.09 [1.03; 1.20]



0.89 [1.79; 3.58]

3. Duration

(1) Long (10 weeks)



0.71* [0.01; 1.40]



1.15*** [0.61; 1.69]

(2) Medium (39 weeks)



1.13 [0.17; 2.08]



1.07 [0.08; 2.06]

(3) Short (2 weeks)

4. Feedback Target



0.20 [1.18; 0.79]



0.87 [0.31; 1.42]

(1) Global feedback (on content & organization)



1.47** [0.37; 2.57]



1.24* [0.01; 2.47]

(2) Local feedback (on grammar & vocabulary)



0.78 [0.51; 2.06]



0.98 [0.39; 2.34]

(3) Mixed feedback

5. Activity



0.19* [1.03; 1.41]



0.95 [0.38; 2.28]

(1) Alone



0.52 [0.04; 1.08]



1.10*** [0.65; 1.54]

(2) With peer



1.04 [0.30; 2.37]



0.89 [0.70; 2.48]

(3) With teacher



0.53 [0.32; 1.38]



0.60 [0.38; 1.59]

Notes: CI = condent interval; n = the number of eect sizes; k = the number of studies; g = Hedgesg standardized mean dier- ences; *p < .05; **p < .01; ***p < .001.

Table 4. Moderator analyses in population data.

Between-group comparison Within-group comparison

Population data



g [95% CI]



g [95% CI]

6. Education Level

(1) Secondary



0.25 [2.35; 2.86]



0.51 [1.32; 2.35]

(2) Undergraduate



0.62 [1.53; 2.77]



1.05 [0.49; 2.59]

(3) Post-graduate



0.40 [3.00; 3.80]



0.96 [0.53; 2.44]

(4) Institute



1.15 [0.94; 3.23]

7. Context

(1) EFL



0.62** [0.16; 1.09]



0.98*** [0.58; 1.37]

(2) ESL

8. Prociency



0.38 [0.74; 1.50]



1.02 [0.02; 2.02]

(1) Basic



1.00 [2.44; 4.44]



2.85* [1.42; 4.29]

(2) Intermediate



1.03 [1.92; 3.98]



1.30 [1.12; 1.49]

(3) Advanced



2.00 [0.81; 4.81]



1.25*** [0.62; 1.88]

(4) Mixed



0.23 [2.73; 3.20]



0.55 [0.65; 1.75]

(5) Basic + Intermediate



0.60 [0.84; 2.04]

Notes: CI = condent interval; n = the number of eect sizes; k = the number of studies; g = Hedgesg standardized mean dier- ences; *p < .05; **p < .01; ***p < .001.

writing (g = 0.73). In terms of the examined subcategories of writing, a large between-group eect was found in vocabulary (g = 0.83) and a medium eect in content and organization (g = 0.74) and text complexity (g = 0.58). However, a small between-group eect size was observed in the case of grammar and mechanics (g = 0.27). Data from within-group comparison showed large eect sizes in overall writing (g = 1.24) and the three writing subcategories including content and organization (g

= 0.88), grammar and mechanics (g = 0.86), and vocabulary (g = 0.99). The within-group eect size of text complexity was still at a medium level (g = 0.59). There was a shortage of studies on style in both between-group comparison (k = 1) and within-group comparison (k = 2).

Second, regarding moderator analysis of the tools, there were 18 dierent AWE tools investigated in the literature. Therefore, we decided to report the results of the three most commonly used tools (e.g. Criterion, Grammarly, Pigai). These tools needed to be included in at least more than one study. The list of all the AWE tools with their key function and eect sizes are presented in Appendix 4. The result of moderator analysis on tools showed a small between-group eect size of Criterion (g = 0.34) and a large between-group eect size of Grammarly (g = 1.04). Pigai did not show a dierential eect on improving student writing performance compared to the traditional teaching method (g = 0.09). In within-group comparison data, the eect size of Grammarly (g = 1.86) was twice as large as other tools (e.g. Criterion, Pigai).

Third, the analysis on duration indicated a large between-group eect size in medium duration (g = 1.13) and a medium eect size in long duration (g = 0.71). Short duration produced a small negative between-group eect size (g = 0.20). Data from within-group comparison revealed large eect sizes in long duration (g = 1.15), medium duration (g = 1.07), and short duration (g = 0.87).

Table 5. Moderator analyses in publication data.

Between-group comparison within-group comparison

Publication data



g [95% CI]



g [95% CI]

9. Publication Type




0.39 [1.11; 1.88]



1.01 [0.04; 1.97]

(2) General journal



0.91 [0.61; 2.44]



0.75 [0.33; 1.84]

(3) Conference paper



0.69 [0.64; 2.02]



1.42 [0.62; 2.23]

(4) Dissertation/Thesis

10. Publication Year


β [95%


CI] 0.04 [

0.17 [1.87; 2.21]

0.05, 0.12]


β [95% CI]


0.09 [

0.25 [1.47; 1.97]

0.01; 0.19]

Notes: CI = condent interval; n = the number of eect sizes; k = the number of studies; g = Hedgesg standardized mean dier- ences; *p < .05; **p < .01; ***p < .001.

Fourth, concerning feedback target, studies utilizing global feedback oered by the AWE tools resulted in a large between-group eect size (g = 1.47), while it was at a medium level (g = 0.78) with studies targeting local feedback. The mixed feedback type from the AWE tools did not over- weigh the traditional teaching method on improving student writing performance since the eect size was negligible (g = 0.19). The within-group eect sizes were large in three types of feed- back target including global feedback (g = 1.24), local feedback (g = 0.98), and mixed feedback (g = 0.95).

The last subgroup analysis from the treatment data was activity. The results showed a large between-group eect size of AWE intervention when students learned with their peers (g = 1.04). The eect sizes were medium in cases of learning with teachers (g = 0.53) or alone (g = 0.52). In the case of within-group comparison, learning with teachers only produced a medium eect size (g = 0.60) which was the lowest compared to large eect sizes produced in learning alone (g = 1.10) and learning with peers (g = 0.89).

Population data

Table 4 below presents moderator analyses from population data including education level, context, and prociency. Regarding education level, a majority of past research focused on the undergradu- ate level, and the eect size was medium in between-group comparison (g = 0.62) and large in within-group comparison (g = 1.05). Several attempts made on secondary school level showed a pre- liminary small between-group eect size (g = 0.25) and medium within-group eect size (g = 0.51). There was a lack of research on other levels with the number of studies less than three.

In terms of context, past research mostly investigated the eectiveness of AWE with EFL students. The result indicated a medium between-group eect size (g = 0.62) and a large within-group eect size (g = 0.98) in EFL students. A few investigations on ESL students showed a small between-group eect size (g = 0.38) in ESL students, while the eect size is still large (g = 1.02) in the case of within- group comparison.

Lastly, concerning prociency data, intermediate and mixed prociency levels were the two most investigated populations in the literature. While a group of students of intermediate prociency largely beneted from AWE (g = 1.03), a group of mixed prociency levels only received a relatively small eect (g = 0.23) in comparison to the eect from traditional teaching. Data from within-group comparison showed a large eect size at the intermediate level (g = 1.30), and the eect was medium at the mixed level (g = 0.55). There was a shortage of studies at the basic prociency level (k = 1 in both between- and within-group comparisons). Also, relatively few studies attempted to compare the eect of AWE with traditional teaching in students of advanced prociency (k = 1). Three within-group studies found for the meta-analysis at the advanced level produced a large- pooled eect size (g = 1.25).

Publication data

Table 5 presents the results from the moderator analyses of publication data which includes publi- cation type and publication year. In publication type, the moderator analysis showed that the studies submitted to the high impact journals (e.g. SSCI/ESCI) were more likely to report a small between- group eect size (g = 0.39) of AWE, while those submitted to the more general or lower impact jour- nals would report a large between-group eect size (g = 0.91). Other publication sources such as conference papers or dissertation/thesis also showed a dierent tendency in which the reported eect sizes were respectively medium (g = 0.69) and negligible (g = 0.17). The within-group eect sizes were large in SSCI/ESCI (g = 1.01) and conference paper (g = 1.42), medium in general journal (g = 0.75), and small in dissertation/thesis (g= 0.25). Regarding publication year, it did not change the overall relationship between AWE and student writing performance both in between-group (β = 0.04, p > .05) and within-group (β = 0.09, p > .05) comparisons.


Our meta-analysis uncovers some variation in the eectiveness of AWE on student writing perform- ance. With a current data set based on 85 between-group eect sizes from 24 studies and 178 within- group eect sizes from 34 studies, it enabled us to suciently account for the variation. In this section, we attempt to present the overall picture by addressing the two research questions. Below are the main ndings of the present meta-analysis:

  1. AWE has positive treatment eects on student writing compared to both non-treatment and non-AWE treatment conditions.

  2. AWE had the most consistent and large eect on vocabulary, which refers to the word usage in writing. However, it had the smallest eect on grammar (language use) and mechanics (spelling and punctuation).

  3. Grammarlys performance indicated it to be the most ecient tool in assisting writing, while

    Pigai did not show the expected eect.

  4. Medium and long duration of AWE treatment (i.e. more than two weeks) showed higher impact on writing outcome compared to non-AWE treatment conditions, but short duration (i.e. less than or equal to two weeks) showed lower impact.

  5. Studying with peers in the AWE condition potentially produced the largest eect.

  6. Current AWE instruction is more benecial to undergraduate students but less to secondary students.

  7. AWE is more benecial to EFL students than ESL students.

  8. AWE showed a large eect on students at the intermediate prociency level.

How eective is AWE?

With a medium overall between-group eect size (g = 0.59) and large overall within-group eect size (g = 0.98), AWE shows positive eects in improving student writing performance. These outcomes support the integration of AWE into the writing classroom. However, the variation in the eect size indicate that there is still room for AWE to develop, especially when compared to the traditional teaching method. Some suggestions can be oered by examining the results of the moderator ana- lyses as discussed in the following section.

How can any observed variation be accounted for?

It is essential for any meta-analytic study to investigate what works for whom in what circumstances and in what respects, and how(Pawson & Tilley, 2004, p. 151). In other words, it requires researchers to identify the degree of contribution of dierent moderator variables to the overall eects (Boulton & Cobb, 2017). Taking these viewpoints into account, the below discussion attempts to present: (1)- in what respects can AWE be benecial; (2)- which AWE tools are more ecient; (3)- how to use AWE

more eectively; (4)- to whom AWE usage is more eective. In addition, the variation observed in the publication data (e.g. publication type, publication year) and the design of post-tests in the collected studies are also discussed to understand the potential publication bias and design-related issues.

First, the ndings show the high benet of AWE on improving vocabulary (e.g. word choice/ usage) in writing (gbetween = 0.83; gwithin = 0.99). This would indicate that the AWE can provide feed- back on various vocabulary options for students to use to enrich their writing (Shang, 2019). More- over, learning new vocabulary could be less challenging as opposed to learning grammar according to many teachers and scholars (Coady & Huckin, 1997).

An unexpected nding is on the small between-group eect size of AWE on grammar and mech- anics (g = 0.27). Grammar refers to the language use in writing, such as whether a sentence is used

correctly, and the mechanics relates to the word spelling and punctuation. To explore the possible reasons for the low eect, an examination of the primary studies revealed that AWE produced neg- ligible or small eect sizes when the studies employed less ecient tools such as Pigai and Jukuu (located at and was a part of Pigai) (see, e.g. Hu & Zhao, 2015; Shang, 2019). Conrmation of the results through the present meta-analysis indicated that the negative eect sizes occurred when studies had short (see, e.g. Gao & Ma, 2019; Liu et al., 2017) or medium intervention time (see, e.g. Choi, 2010; Choi, 2011). Large eect sizes of AWE would be observed in studies that solely investigated on mechanics (see, e.g. Choi, 2011) or studies with long intervention time and use of other tools (see, e.g. Barrot, 2021; Ghufron & Rosyida, 2018; Wang et al., 2013) besides Pigai or Jukuu. These ndings provide sucient information for a con- clusion that using adequate AWE tools for a longer period is required to improve studentsgrammar in writing. This also aligns with the aforementioned challenges students face in learning grammar stated by many scholars.

Second, regarding the ecacy of dierent AWE tools, Grammarly came out as being the most eective in improving studentswriting performance. In agreement with ONeill and Russell (2019), some explanations for the eectiveness of Grammarly could be (1)- its ability to oer teachers the opportunity to use both indirect feedback (e.g. highlighting the errors) and direct feedback (e.g. giving explicit corrections), (2)- its ability to provide not only extensive feedback (e.g. the program can determine 250 breaches in grammatical rules) but also focused feedback (e.g. providing cues to address high-frequency errors), and (3)- its easy-to-use system. However, the limitation of Grammarly is that it has not oered global writing feedback (e.g. feedback on content and organization, feed- back on discourse level). In this case, Criterion can be an option for providing global feedback since it can produce a slightly higher eect than the traditional teaching method. Unlike Grammarly, Pigai did not show any superior eect to traditional teaching. As studied by Gao (2021), Pigai could not diagnose the essay as well as teachers did. The system is only able to diagnose word-related errors but inadequately identify the language errors in all aspects, and the systems suggestions about syntactic use also lack in quality.

Third, the response to the question on how to use AWE more eectively is discussed in three respects; duration, feedback target, and activity. In the duration moderator analysis, short duration (i.e. less than or equal to two weeks) does not oer a desirable outcome in comparison to traditional teaching (g = 0.20). This can be explained from the perspective of Dekeysers skill acquisition theory (2007). Initially, the AWE feedback system had students become aware of grammatical or other writing- related rules (i.e. presentation of declarative knowledge). Then, students will be oered the opportu- nities to internalize the AWE feedback through multiple cycles of revising the written texts (i.e. the practice of procedural skills). According to the theory, it requires a longer time and more practice of procedural skills for a specic language skill, like producing new texts, to become automatic (Liao, 2016b). Therefore, it is understandable that the eect of a short duration is much less observable.

Another possible interpretation regarding the low eect of short-duration studies can be that stu- dents lacked sucient amount of time to eectively learn to use the new AWE tools in a short period of time. Among short-duration studies collected in the present meta-analysis, most of them did not report on the tool-training session except in Xie et al. (2020), which had relatively short training lasting only 30 min. In addition, studentsreection on their degree of familiarity with using the AWE tools was also not reported. Therefore, while it is possible that the tool-learning time can be a confounding variable on the eectiveness of AWE, it may be that either the tools are easy to use or that students were already familiar with using them for which no substantial training was needed. Since there was little evidence in the past short-duration studies discussing about the potential impact of tool-learning time, future investigation may be necessary. Researchers can incor- porate qualitative investigation when studying studentsperception on whether any diculties were encountered when using the AWE tools or observe how much time students would need to use the tools prociently. Proper software training time can then be allotted to better facilitate stu- dentslearning process, as well as designing a suitable pre-training session for future short-duration

studies related to the eectiveness of AWE tools. This can control for the confounding eect of stu- dentsprociency in using the new tools.

In contrast, a medium duration produces a large eect on writing outcomes in comparison with traditional teaching (g = 1.13), and a long duration produces a medium eect (g = 0.71). Based on the viewpoint of the skill acquisition theory discussed above, it is not surprising when medium and long duration show a higher impact on writing outcomes. The question is on the lower impact of long duration compared to medium duration. Returning to the discussion on the eectiveness of dierent AWE tools, the rst possible explanation may be due to the tool being a signicant mod- erator, which could lower the eect in long-duration studies but not the duration itself. Data from the collected studies show that 14 out of 25 eect sizes of long-duration between-group studies are from those using Pigai and Jukuu (a part of Pigai), while these two inecient tools are not used in any of the medium-duration studies.

In terms of feedback target, high ecacy is found when both global feedback (e.g. feedback on content and/or organization) or local feedback (e.g. feedback on vocabulary and/or grammar) of the AWE tools is applied. When mixed type of feedback (i.e. feedback on both global and local levels) is used, the eect size is found to be negligible (g = 0.19) compared to traditional teaching. Again, we believe the low eect is caused by the AWE tools. Among 11 studies with 55 eect sizes on mixed feedback, 5 studies with 29 eect sizes were constituted by Pigai and Jukuu. Only two studies that produced a large eect size with mixed feedback type used My Access! and WhiteSmoke. Criterions eects ranged from negative to large depending on the learning context, and the negative eects only occurred in the ESL context (Choi, 2010, 2011). To support our potential interpretation, we per- formed a multimodel interference of 10 examined moderators in between-group comparison data. The result indicated that the most signicant and important predictor of the eect was the tools with a coecient estimate of 0.91, and other moderators had coecient values lower than 0.50 (see Appendix 2). Therefore, it can be inferred that the eect AWE in writing will be signicantly inuenced if there is a change in the type of the AWE tools.

Regarding activity, a few studies investigating learning with peers showed an initial overall large eect in both between- and within-group comparisons (g = 1.04; g = 0.89 respectively). This could be a promising nding for many teachers because peer review activities with the help of AWE can reduce teachersworkload and shorten the time of delivering feedback to students (Huang & Renan- dya, 2020). However, more future AWE studies with peer review activities can be conducted to verify our current ndings. Most of the current collected studies had students independently use AWE tools and the results revealed a medium eect size in between-group comparison (g = 0.52). In spite of this, the within-group comparison is both large and signicant (g = 1.10, p < .001). This indi- cates a high consistent ecacy of AWE when students study alone regardless of the context. Teacher review activity with the help of AWE seems to be the least ecient compared to peer review and independent study, though the eect is sucient (i.e. at the medium level) in both between (g = 53) and within-group (g = 0.60) comparisons. Many scholars agree that writing teachers may encoun- ter great diculty instructing students when the class size is large (Chen et al., 2017; Huang & Renan- dya, 2020; Warschauer & Ware, 2006). Students are suggested to seek help from other sources (e.g. peers) or learn to gradually become independent learners within an AWE learning condition.

Fourth, in answering the question to whom AWE usage is most eective, we looked at the edu- cation level results, context, and prociency of the students. The data showed that the majority of AWE studies focused on the undergraduate education level (19/24 and 28/34 studies respectively in between- and within-group comparisons) and EFL context (22/24 and 29/34). Therefore, the eect sizes of these two types of population data were similar in the overall eect sizes (i.e. medium eect size in between-group comparison and large eect size in within-group comparison). The variation in these two eects should also be similar to our previous discussion on the overall results. A point to consider is the overall small between-group eect size (g = 0.25) and medium within-group eect size (g = 0.51) from a few studies on secondary school students. It is likely that secondary school students do not benet as much as other student levels (e.g. undergraduate

students). Furthermore, AWE seems to have a larger impact on EFL than ESL students in between- group comparisons (g = 0.62 as opposed to g = 0.38). More future research and exploration on sec- ondary school and ESL students can be conducted to verify these preliminary ndings.

Regarding English prociency level, most of the primary studies investigated students with inter- mediate or mixed level prociency, a commonly found population in undergraduate school. Only studies with students at the intermediate level showed large eect sizes both in between- and within-group comparisons (g = 1.03, g = 1.12 respectively). However, a small between-group eect size (g = 0.23) and a medium within-group eect size (g = 0.55) are present in the mixed English level class. Notably, all of the studies targeting the mixed prociency level also employed mixed feedback from the AWE tools, except for two studies creating two eect sizes of Hoang (2019) and Jayavalan and Razali (2018). Similar to the discussion in the feedback target session, we hold the view that the lower eects may be caused by the AWE tools and not necessarily by the mixed type of feedback or English prociency.

In order to identify whether any potential publication bias may be present in the meta-analysis, we conducted moderator analyses on the publication data (e.g. publication type, publication year). Albeit there is no signicant dierence among the variables in the publication type (p > 0.05), a dis- tinct between-group eect size can be observed between SSCI/ESCI category journals and general journals. The former showed a small overall eect size (g = 0.39), while the eect is large in the latter (g = 0.91). From our data collection, those studies published in SSCI/ESCI journals with mostly nega- tive reports of AWE typically employed Pigai (15/34 between-group eect sizes) and self-established AWE systems (11/34 between-group eect sizes). While two long-duration studies using eective tools (e.g. Grammarly, Criterion) showed large between-group eect sizes of AWE (see, e.g. Barrot, 2021; Hassanzadeh & Fotoohnejad, 2021). Hence, the selected tools may have resulted in a low eect found in SSCI/ESCI journals instead of the dierent publication types. Moderator analyses on publication year indicated an unchanged relationship with years of publication in the overall AWE eect sizes (βbetween = 0.04; βwithin = 0.09; p > 0.05 in both cases). Therefore, our analyses of the publication data indicate that publication bias does not have a clear impact on our current data. Our nal discussion is on the design of the post-tests in the studies collected for the present meta- analysis. One common aspect is that all the post-tests are direct writing production tests, meaning participants need to write paragraphs or essays on the given topics. This is an essential characteristic in the post-tests since the aim of the present study is to examine the eectiveness of AWE on stu- dentswriting outcomes. However, there are some variations in the design of the post-tests. Firstly, while approximately half of the collected studies share the same writing topics for both pre- and post-tests, others provide dierent topics for the pre- and post-tests but with similar topic genre (e.g. expository genre). Having the same writing topics for both pre- and post-tests might produce more comparable texts but face higher pre-test sensitization eect as studentscognitive

gains would be largest with similar tests (Willson & Putnam, 1982).

Secondly, most of the primary studies used intervention related to researcher-developed tests in the post-tests. A few studies used the writing topics from standardized tests, but participants were already familiar with those topics because they were taught the same topic genre during their learn- ing process. The topics of the standardized tests may also be embedded in the AWE tools such as Criterion and Pigai used in their practice (see, e.g. Chang et al., 2021; Hoang, 2019; Huang & Renan- dya, 2020; Li et al., 2015; Wang, 2013). Hence, a practice eect may have played a role in the primary studies. The practice eect might not be prominent in the between-group comparison because the control groups needed to write the same topics as the experimental groups. The eect sizes found in the within-group comparison might be interpreted with caution, but the comparison among studies should be plausible as a practice eect is present in all the primary studies.

Lastly, most of the primary studiespost-tests were conducted in class or laboratory condition,

while a few of the studies allowed participants to complete the post-test writing after class (see, e.g. Liu et al., 2017; Lu & Li, 2016; Ohta, 2008; Xie et al., 2020). Indeed, studies having participants complete the post-tests outside of the laboratory condition have high risks of internal validity.

However, whether all study variables under the laboratory condition should be controlled or not can be a challenging decision for social science researchers to make, because the attempt to strengthen internal validity would also weaken external validity and vice versa (Nunan, 1992). In other words, what occurs in the laboratory condition may not occur under typical circumstances (e.g. in a general classroom where some teachers would have writing assignments as studentshomework). Therefore, we decided to acknowledge the contribution of both conditions and include all of those studies in the analysis. Furthermore, because having students complete the post-tests after class accounted for a small proportion in the current meta-analysis (4 out of 44 collected studies), its inuence on the overall eect sizes should be minimal.

For a more thorough meta-analysis, it could be useful to conduct moderator analyses on the design data to explore potential varying eects caused by the study design. For example, from our discussion on the design of the post-tests, it may be benecial to examine the potential of dierent eect sizes between studies with the same writing topics and studies with dierent writing topics in the pre and post-tests. Exploration on studies with post-tests conducted in dierent contexts (under the classroom/laboratory condition or outside the classroom) can be useful as well. Nevertheless, the goal of our present meta-analysis is to address possible pedagogical implications with AWE in EFL/ESL writing performance. Future researchers who wish to contribute to the methodological implications could pay more attention to having the design data variables as moderators in a meta-analysis.


A noticeable limitation in our study is that we did not examine the eects of missing data or biases in our meta-analysis by conducting several popular tests such as the Eggers regression test, fail-safe N tests, or trim-and-ll method. Similar to Assink and Wibbelink (2016), we found no evaluation or ocial guideline of the available methods for managing missing data or biases in a three-level meta-analytic study. A possible alternative is to apply the available methods by examining the biases in the same manner as handling a two-level conventional meta-analysis. However, the results of the overall eect sizes from the two models (e.g. three-level and two-level models) were dierent. In our present meta-analysis, the dierence is also signicant as stated in the result section. Therefore, similar approaches used for a two-level conventional meta-analysis might not appropriately explain or handle our missing data.

Another alternative is to calculate the average eect size of each independent study, then use all the calculated average eect sizes as the dataset for handling biases or missing data. This method by nature would create more bias as there were already potential biases present between dierent the eect sizes within a study. Harrer et al. (2021) also presented outcome reporting bias as one of the sources of bias in meta-analysis. The authors contended that many of the studies with multiple out- comes would drop the negative ndings out of their report and only keep the positive ndings, then produce the outcome reporting bias. In sum, including all the dierent eect sizes within a study can better explain the missing data instead of using only the average eect size of an entire study.

To conclude, due to the lack of available statistical methods that can directly identify publication bias (Harrer et al., 2021) and the lack of evaluation of the available methods in a three-level meta- analysis (Assink & Wibbelink, 2016), we were more interested in analyzing and interpreting the eect sizes found from our current data. Potential publication bias may be indirectly observed from our moderator analysis in the publication data. Handling the missing data through a three- level meta-analysis is our current limitation.

Conclusion and future directions

The present study attempts to meta-analyze the studies on the eectiveness of AWE on EFL/ESL student writing performance. Overall, AWE has a medium eect size compared to traditional teaching

and largely facilitates within-subject development. The ndings support the application of AWE in the writing classroom. The present study also provides other suggestions on how to best apply AWE to writing instruction. First, AWE is indeed eective in improving students use of vocabulary, but it may need a longer time frame to improve students grammar in writing. Second, the choice of tool should be carefully considered as it would signicantly aect the learning outcome. Third, the duration of AWE use should be medium to long term use. A short duration (i.e. less than 3 weeks) would have little to no eect. Fourth, peer review activities can be encouraged to better facilitate studentswriting performance. Finally, the current AWE application is more benecial to students in undergraduate school, students studying in an EFL context, and students with an intermediate prociency level.

Some directions for future investigation can be gathered from this meta-analysis. First, we noticed that past studies on AWE rarely designed delayed post-tests to examine the retention eect of AWE on writing performance. Future experimental researchers are recommended to attend to designing delayed posttest so that a more comprehensive picture of the eectiveness of AWE can be discussed. Second, the tools that provided both global and local feedback such as My Access! or WhiteSmoke can be worth investigating in future research. As previously discussed, Grammarly is an eective tool but is limited to only providing local feedback. Criterion and Pigai, which can provide both global and local feedback, did not show an overall large eect. However, two studies comparing the eective- ness of My Access! (Khoii & Doroudian, 2013) and WhiteSmoke (Toranj & Ansari, 2012) produced large eect sizes compared to traditional teaching respectively. This would call for future investigations to verify these ndings. Third, future research on AWE can design peer review activities so that the eectiveness of studying with peers in an AWE condition can be more conclusive. Moreover, this will provide further evidence for teachers in deciding the learning activities for students.

Another direction can be the investigation on the dierential eectiveness of direct versus indir- ect feedback or extensive versus focused feedback produced by the tools. In the present meta-analy- sis, we were not able to examine these aspects due to the lack of relevant information from the primary studies. For example, only two studies reported the use of indirect feedback (see, e.g. Choi, 2011; Liu et al., 2017), while many papers only reported the target writing aspects (e.g. voca- bulary, grammar, organization, content) that the tools oered without any clues of whether the treated feedback was indirect, direct, or both. Similarly, there was not much information regarding extensive and focused feedback in the primary studies. These levels of feedback are worth investi- gating in future studies to contribute to the eective development of AWE tools.

Some other areas that lack investigation such as the eect of AWE on writing style or students with basic or advanced English prociency can be further explored in the future as well. Lastly, regarding research methodology, a worthy direction for future research is to evaluate the available methods in handling missing data or biases through a three-level meta-analysis. Furthermore, con- sidering design data variables as moderators in a meta-analysis could also contribute to the under- standing of the eects of dierent research designs.

Disclosure statement

No potential conict of interest was reported by the author(s).


This work was supported by the Ministry of Science and Technology in Taiwan [grant number MOST110-2923-H-003

-002 -MY2].

Notes on contributors

Thuy Thi-Nhu Ngo is a doctoral student of the English Department at National Taiwan Normal University, Taipei, Taiwan. Her research interests include computer-assisted language learning and meta-analysis.

Howard Hao-Jan Chen is distinguished Professor of the English Department at National Taiwan Normal University, Taipei, Taiwan. Professor Chen has published several papers in CALL Journal, ReCALL Journal, and several related language learning journals. His research interests include computer-assisted language learning, corpus research, and second language acquisition.

Kyle Kuo-Wei Lai is a doctoral student of the English Department at National Taiwan Normal University, Taipei, Taiwan. His research interests include computer-assisted language learning and digital game-based language learning.


Howard Hao-Jan Chen


